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INVERSE INFERENCE ENGINE FOR HIGH merits can be ranked in decreasing order relative to the 

PERFORMANCE WEB SEARCH frequency of occurrence of those weighted terms. 

'llie vector space model (Salton, 1983, No. 30 in Appen- 

CROSS REFERENCE TO RELATED dix A) views documents and queries as vectors in a high- 

APPLICATIONS 5 dimensional vector space, where each dimension corre- 
sponds to a possible document feature. The vector elements 

This application claims priority under 35 U.S.C. §119(e) may De binary, as in the exact-match model, but they are 

to provisional patent application Ser. No. 60/125,714 filed usually taken to be term weights which assign "importance" 

Mar. 23, 1999, values to the terms within the query or document. The term 

10 we ig nts are usually normalized. The similarity between a 

STATEMENT REGARDING FEDERALLY given query and a document to which it is compared is 

SPONSORED RESEARCH OR DEVELOPMENT considered to be the distance between the query and docu- 

™ . . r L - ment vectors. The cosine similarity measure is used most 

The development of this invention was supported at least - tl c l4 . / , . 

- i_ iL ii j c. . xr it c ii i.u frequently for this purpose. It is the normal inner product 

in part by the United States National Institutes of Health . « ■ , . 

/vtiii\ ■ o ii n * i .* , c between vector elements: 

(NIH) in connection with Small Business Innovation 15 

Research Grant 5 R44 CA6161-03, and by the the United 

States Defense Advanced Research Project Agency y w 

(DARPA) in connection with Small Business Innovation vv>v rf( . >» ' 

Research Contract DAAH01-99-C-R162. Accordingly, the cos(9 ' Di) ~ |K||j|>4 

United States Government may have certain rights in the 20 
present invention. 

BACKGROUND OF THE INVENTION 

m . „ , where q is the input query, D £ is a column in term-document 

Hie present invention relates general y to computer-based 25 matfix> fa ^ wei ht assi d tQ tefm . in the queryj 

information retrieval, and more particularly to a system and fa |hc we ^ t assi d tQ tefm . in document L ms similarity 

method for searching databases of electronic text. givcs a ya[ue of Q when me document ^ query 

llie commercial potential for information retrieval sys- have no terms in common and a value of 1 when their 

terns that can query unstructured text or multimedia collec- vectors are identical. The vector space model ranks the 

lions with high speed and precision is enormous. In order to 30 documents based on their "closeness" to a query. The 

fulfill their potential, collaborative knowledge based sys- disadvantages of the vector space model are the assumed 

terns like the World Wide Web (WWW) must go several independence of the terms and the lack of a theoretical 

steps beyond digital libraries, in terms of information justification for the use of the cosine metric to measure 

retrieval technology. In order to do so, unstructured and similarity. Notice, in particular, that the cosine measure is 1 

heterogeneous bodies of information must be transformed 35 on ] y if w^-W rf/ . This is very unlikely to happen in any 

into intelligent databases, capable of supporting decision search, however, because of the different meanings that the 

making and timely information exchange. The dynamic and weights w often assume in the contexts of a query and a 

often decentralized nature of a knowledge sharing environ- document index. In fact, the weights in the document vector 

ment requires constant checking and comparison of the arc an expression of some statistical measure, like the 

information content of multiple databases. Incoming infor- 40 absolute frequency of occurrence of each term within a 

mation may be up-to-date, out-of-date, complementary, con- document, whereas the weights in the query vector reflect 

tradictory or redundant with respect to existing database tne relative importance of the terms in the query, as per- 

entries. Further, in a dynamic document environment, it is ceived by the user. 

often necessary to update indices and change or eliminate for any given search query, the document that is in fact 

dead links. Moreover, it may be desirable to determine 45 tne best match for the actual information needs of the user 

conceptual trends in a document set at a particular time. may emp i 0 y synonyms for key concepts, instead of the 

AdditionaUy, it can be useful to compare the current docu- specific keywords entered by the user. This problem of 

ment set to some earlier document set in variety of ways. "synonymy" may result in a low similarity measure between 

As it is generally known, information retrieval is the the search query and the best match article using the cosine 

process of comparing document content with information so metric. Further, terms in the search query have meanings in 

need. Currently, most commercially available information the context of the search query which are not related to their 

retrieval engines are based on two simple but robust metrics: meanings within individual ones of the documents being 

exact matching or the vector space model. In response to an searched, ITiis problem of "polysemy" may result in rela- 

inpul query, exact-match systems partition the set of docu- lively high similarity measures for articles that are in fact not 

ments in the collection into those documents that match the 55 relevant to the information needs of the user providing the 

query and those that do not. llie logic used in exact-match search query, when the cosine metric is employed, 

systems typically involves Boolean operators, and accord- Some of the most innovative search engines on the World 

ingly is very rigid: the presence or absence of a single term Wide Web exploit data mining techniques to derive implicit 

in a document is sufficient for retrieval or rejection of that information from link and traffic patterns. For instance, 

document. In its simplest form, the exact-match model docs go Google and CLEVER analyze the "link matrix" (hyperlink 

not incorporate term weights, 'llie exact-match model gen- structure) of the Web. In these models, the weight of the 

erally assumes that all documents containing the exact result rankings depends on the frequency and authority of 

tcrm(s) found in the query arc equally useful. Information the links pointing to a page. Other information retrieval 

retrieval researchers have proposed various revisions and models track user's preferences through collaborative 

extensions to the basic exact-match model. In particular, the 65 filtering, such as technology provided by Firefly Network, 

"fuzzy-set" retrieval model (Lopresti and Zhou, 1996, No. Inc., LikeMinds, Inc., Net Perceptions, Inc., and Alexa 

21 in Appendix A) introduces term weights so that docu- Internet, or employ a database of prior relevance 
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judgements, such as technology provided by Ask Jeeves, 
Inc. The Direct Hit search engine offers a solution based on 
popularity tracking, and looks superficially like collabora- 
tive filtering (Werbach, 1999, No. 34 in Appendix A). 
Whereas collaborative filtering identifies clusters of asso- 5 
ciations within groups, Direct Hit passively aggregates 
implicit user relevance judgements around a topic. The 
InQuery system (Broglio et al, 1994, No. 8 in Appendix A; 
Rajashekar and Croft, 1995, No. 29 in Appendix A) uses 
Baycsian networks to describe how text and queries should 10 
be modified to identify relevant documents. InQuery focuses 
on automatic analysis and enhancement of queries, rather 
than on in-depth analysis of the documents in the database. 

While many of the above techniques improve search 
results based on previous user's preferences, none attempts 15 
to interpret word meaning or overcome the fundamental 
problems of synonymy, polysemy and search by concept. 
These are addressed by expert systems consisting of elec- 
tronic thesauri and lexical knowledge bases. l*he design of 
a lexical knowledge base in existing systems requires the 20 
involvement of a large teams of experts. It entails manual 
concept classification, choice of categories, and careful 
organization of categories into hierarchies (Bate man et al, 
1990, No. 3 in Appendix A; Bouad et al, 1995, No. 7 in 
Appendix A; Guarino, 1997, No. 14 in Appendix A; Lenat 25 
and Guha, 1990, No. 20 in Appendix A; Mahcsh, 1996, No. 
23 in Appendix A; Miller, 1990, No. 25 in Appendix A; 
Mahesh et al, 1999, No. 24 in Appendix A; Vogel, 1997 and 
1998, Nos. 31 and 32 in Appendix A). In addition, lexical 
knowledge bases require careful tuning and customization to 30 
different domains. Because they try to fit a preconceived 
logical structure to a collection of documents, lexical knowl- 
edge bases typically fail to deal effectively with heteroge- 
neous collections such as the Web. By contrast, the approach 
known as Latent Semantic Indexing (LSI) uses a data driven 35 
solution to the problem of lexical categorization in order to 
deduce and extract common themes from the data at hand. 
LSI and Multivariate Analysis 

Latent Semantic Analysis (LSA) is a promising departure 
from traditional models, 'llie method attempts to provide 40 
intelligent agents with a process of semantic acquisition. 
Researchers at Bellcore (Dccrwcstcr et al, 1990, No. 10 in 
Appendix A, U.S. Pat. No. 4,839,853; Berry el al, 1995, No. 
5 in Appendix A; Duma is, 1991, No. 11 in Appendix A; 
Dumais et al, 1998, No. 12 in Appendix A) have disclosed 45 
a computationally intensive algorithm known as Latent 
Semantic Indexing (LSI). This is an unsupervised classifi- 
cation technique based on Singular Value Decomposition 
(SVD). Cognitive scientists have shown that the perfor- 
mance of LSI on multiple-choice vocabulary and domain so 
knowledge tests emulates expert essay evaluations (Foltz et 
al, 1998, No. 13 in Appendix A; Landauer and Dumais, 

1997, No. 16 in Appendix A; Landauer et al., 1997, 1998a 
and 1998b, Nos. 17, 18 and 19 in Appendix A; Wolfe et al, 

1998, No. 36 in Appendix A). LSI tries to overcome the 55 
problems of query and document matching by using statis- 
tically derived conceptual indices instead of individual terms 
for retrieval. LSI assumes that there is some underlying or 
latent structure in term usage. This structure is partially 
obscured through variability in the individual term attributes 60 
which are extracted from a document or used in the query. 

A truncated singular value decomposition (SVD) is used to 
estimate the structure in word usage across documents. 
Following Berry et al (1995), No. 5 in Appendix A, let D be 
a mxn term-document or information matrix with m>n, 65 
where each element d,y is some statistical indicator (binary, 
term frequency or Inverse Document Frequency (IDF) 



,406 Bl 

4 

weights — more complex statistical measures of term distri- 
bution could be supported) of the occurrence of term i in a 
particular document j, and let q be the input query. LSI 
approximates D as 

D'-V k h k V k T 

where A=diag(>v lT . . . ,>^), and {K,,i-l,k} arc the first k 
ordered singular values of D, and the columns of V k and V^ 
are the first k orthonormal eigenvectors associated with DD 
and D T D respectively. 'ITie weighted left orthogonal matrix 
provides a transform operator for both documents (columns 
of D') and q: 

The cosine metric is then employed to measure the similarity 
between the transformed query a and the transformed docu- 
ment vectors (rows of V^) in the reduced k-dimensional 
space. 

Computing SVD indices for large document collections 
may be problematic. Berry et al (1995), No. 5 in Appendix 
A, report 18 hours of CPU time on a SUN SPARC 10 
workstation for the computation of the first 200 largest 
singular values of a 90,000 terms by 70,000 document 
matrix. Whenever terms or documents are added, two alter- 
natives exist: folding-in new documents or recomputing the 
SVD. The process of folding-in documents exploits the 
previous decomposition, but does not maintain the orthogo- 
nality of the transform space, leading to a progressive 
deterioration in performance. Dumais (1991), No. 11 in 
Appendix A, and O'Brien (1994), No. 26 in Appendix A, 
have proposed SVD updating techniques. These are still 
computationally intensive, and certainly unsuitable for real- 
time indexing of databases that change frequently. No fast 
updating alternative has been proposed for the case when 
documents are removed. 

Bartell et al. (1996), No. 2 in Appendix A, have shown 
that LSI is an optimal special case of multidimensional 
scaling. The aim of all indexing schemes which are based on 
multivariate analysis or unsupervised classification methods 
is to automate the process of clustering and linking of 
documents by topic. An expensive precursor was the method 
of repertory hypergrids, which requires expert rating of 
knowledge chunks against a number of discriminant traits 
(Boose, 1985, No. 6 in Appendix A; Waltz and Pollack, 
1985, No. 33 in Appendix A; Bernstein et al., 1991, No. 4 in 
Appendix A; Madigan et al, 1995, No. 22 in Appendix A). 
Unfortunately, experience with automated techniques has 
shown that the user cannot readily associate transform axes 
with semantic meaning. In particular, open statistical issues 
in LSI are: (i) determining how many eigenvectors one 
should retain in the truncated expansion for the indices; (ii) 
determining subspaces in which latent semantic information 
can be linked with query keywords; (iii) efficiently compar- 
ing queries to documents (i.e., finding near neighbors in 
high-dimension spaces); (iv) incorporating relevance feed- 
back from the user and other constraints. 

For these reasons, it would be desirable to have an 
information retrieval system which addresses the various 
shortcomings of existing systems, including problems asso- 
ciated with the synonymy, polysemy, and term weighting 
limitations of those existing systems which employ the 
cosine metric for query to document comparisons. 

BRIEF SUMMARY OF THE INVENTION 

In accordance with the present invention, a new system 
and method for latent semantic based information retrieval, 
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which advantageously employs aspects of the Backus- 
Gilbert method for inversion, thus eliminating the need for 
Singular Value Decomposition (SVD). More specifically, the 
disclosed system recasts measurement of the similarity 
between a query and a number of document projections as 
a constrained optimization problem in a linear transform 
space. 

In an illustrative embodiment, the present system per- 
forms a number of document processing steps to pre-process 
the documents in the set of searchable documents, in order 
to generate a representation of the search space. The system 
further performs a number of query processing steps to 
process a search query received from a user to generate a 
query vector for the query. The disclosed system then 
performs a measurement of the similarity between the query 
and document projections as a constrained optimization 
problem in a linear transform space. The algorithm and 
mode of solutions are major differences of the disclosed 
system with respect to the aforementioned vector space and 
LSI approaches. An additional, major conceptual difference 
of this approach, with regard to LSI, is that the similarity 
measurement is not a sequence of two independent steps 
consisting of: 1) decomposing or transforming the term- 
document matrix in a lexical transform space defined by the 
SVD of such matrix and 2) measuring the similarity between 
each query input the user and each document projection in 
the fixed transform space determined by the SVD. Instead, 
the disclosed system, in response to each new query input by 
the user, determines a new lexical transform space, based on 
algebraic and computational principles different from SVD, 
in which to perform the similarity measurement. The decom- 
position or transformation of the term-document matrix and 
measurement of similarity are carried out simultaneously in 
the solution of the constrained optimization problem. This 
approach brings a dramatic improvement in computational 
speed. It also provides important conceptual advantages 
over the unsupervised classification process implied by LSI. 
These advantages include the ability of the search engine to 
interact with the user and suggest concepts that may be 
related to a search, the ability to browse a list of relevant 
documents that do not contain the exact terms used in the 
user query, and support for an advanced navigation tool. 

The disclosed system provides a computationally superior 
algorithm for latent semantic retrieval, which is not based on 
SVD. In algebraic terms, the disclosed approach provides an 
advantageous compromise between the dimensionality of a 
semantic transform space and the fit of the query to docu- 
ment content. The efficiency of the disclosed system comes 
from its building the computation of the distance between 
the query vector and document clusters in the optimization 
problem. Alternative embodiments of the disclosed system 
may employ alternative optimization techniques. In this 
regard, a number of methods to solve the query optimization 
problem have been identified in connection with the present 
invention, including ridge regression, quadratic 
programming, and wavelet decomposition techniques. 

BRIEF DESCRIPTION OF THE SEVERAL 
VIEWS OF THE DRAWING 

The invention will be more fully understood by reference 
to the following detailed description of the invention in 
conjunction with the drawings, of which: 

FIG. 1 is a flow chart showing a series of steps for 
processing documents and processing user queries; 

FIG. 2 shows an architectural view of components in an 
illustrative embodiment; 
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FIG. 3 shows steps performed during feature extraction 
and information matrix (term-document matrix) formation; 

FIG. 4 shows an example of an information (or term- 
document) matrix; 

5 FIG. 5 shows an example of clustering documents repre- 
sented by the term-document matrix of FIG. 4, and illus- 
trates some of the difficulties of performing document 
clustering using LSI; 

10 FIG. 6 shows an example of basis function expansion for 
the single keyword entry "Shakespeare" in an illustrative 
embodiment of the present invention; 

FIG. 7 illustrates a solution of the inverse optimization 
problem for a number of single term queries; 

35 FIG. 8 shows an illustrative Graphical User Interface 
(GUI); 

FIG. 9 shows an interface to an internet navigation tool; 

FIG. 10 is a flow chart which shows a series of steps 
performed by the internet navigation tool of FIG. 9; 

FIG. 11 illustrates an embodiment of a search engine GUI 
for providing direct and latent information in response to a 
query; 

FIG. 12 illustrates the evolution of concepts or conver- 
25 sations at browsing time; and 

FIG. 13 illustrates the concept of hierarchical clustering 
and categorization with inverse decision trees. 

DETAILED DESCRIPTION OF THE 
30 INVENTION 

The disclosure of provisional patent application Ser. No. 
60/125,704 filed Mar. 23, 1999 is hereby incorporated by 
reference. 

35 As illustrated by the steps shown in FIG. 1, the disclosed 
system computes a constrained measure of the similarity 
between a query vector and all documents in a term- 
document matrix. More specifically, at step 5 of FIG. 1, the 
disclosed information retrieval system parses a number of 

40 electronic information files containing text. In an illustrative 
embodiment, the parsing of the electronic text at step 5 of 
FIG. 1 may include recognizing acronyms, recording word 
positions, and extracting word roots. Moreover, the parsing 
of step 5 may include processing of tag information asso- 

45 ciated with HTML and XML files, in the case where any of 
the electronic information files arc in HTML or XML 
format. The parsing of the electronic information files per- 
formed at step 5 may further include generating a number of 
concept identification numbers (concept IDs) corresponding 

5(J to respective terms (also referred to as "keywords") to be 
associated with the rows of the term -document matrix 
formed at step 6. The disclosed system may also count the 
occurrences of individual terms in each of the electronic 
information files at step 5. 

55 At step 6 of FIG. 1, the disclosed system generates a 
term-document matrix (also referred to as "information 
matrix") based on the contents of the electronic document 
files parsed at step 5. In one embodiment, the value of each 
cell in the term-document matrix generated at step 6 indi- 

60 cates the number of occurrences of the respective term 
indicated by the row of the cell, within the respective one of 
the electronic information files indicated by the column of 
the cell. Alternatively, the values of the cells in the term- 
document matrix may reflect the presence or absence of the 

65 respective term in the respective electronic information file. 
At step 7 of FIG. 1, the disclosed system generates an 
auxiliary data structure associated with the previously gen- 
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erated concept identification numbers. The elements of the 
auxiliary data structure generated during step 7 are used to 
store the relative positions of each term of the term- 
document matrix within the electronic information files in 
which the term occurs. Additionally, the auxiliary data 
structure may be used to store the relative positions of tag 
information from the electronic information files, such as 
date information, that may be contained in the headers of 
any HTML and XML files. 

Weighting of the term-document matrix formed at step 6 
may be performed as illustrated at step 8 of FIG. 1. Weight- 
ing of the elements of the term-document matrix performed 
at step 8 may reflect absolute term frequency count, or any 
of several other measures of term distributions that combine 
local weighting of a matrix element with a global entropy 
weight for a term across the document collection, such as 
inverse document frequency. 

At step 9 of FIG. 1, the disclosed system generates, in 
response to the term-document matrix generated at step 6, a 
term-spread matrix. The term-spread matrix generated at 
step 9 is a weighted autocorrelation of the term -document 
matrix generated at step 6, indicating the amount of variation 
in term usage, for each term, across the set of electronic 
information files. The. term-spread matrix generated at step 
9 is also indicative of the extent to which the terms in the 
electronic information files are correlated. 

At step 16, the disclosed system receives a user query 
from a user, consisting of a list of keywords or phrases. The 
disclosed system parses the electronic text included in the 
received user query at step 16. The parsing of the electronic 
text performed at step 16 may include, for example, recog- 
nizing acronyms, extracting word roots, and looking up 
those previously generated concept ID numbers correspond- 
ing to individual terms in the query. In step 17, in response 
to the user query received in step 16, the disclosed system 
generates a user query vector having as many elements as 
the number of rows in the term -spread matrix generated at 
step 9. 

Following creation of the query vector at step 17, at step 
18 the disclosed system generates, in response to the user 
query vector, an error-covariance matrix. The crror- 
covariance matrix generated at step 18 reflects an expected 
degree of uncertainty in the initial choice of terms by the 
user, and contained within the user query. 

At step 10, in the event that the user query includes at least 
one phrase, the disclosed system augments the term- 
document matrix with an additional row for each phrase 
included in the user query. For purposes herein, a "pilose" 
is considered to be a contiguous sequence of terms. 
Specifically, at step 10, for each phrase in the user query, the 
disclosed system adds a new row to the term -document 
matrix, where each cell in the new row contains the fre- 
quency of occurrence of the phrase within the respective 
electronic information file, as determined by the frequencies 
of occurrence of individual terms composing the phrase and 
the proximity of such concepts, as determined by their 
relative positions in the electronic information files, as 
indicated by the elements of the auxiliary data structure. In 
this way the auxiliary data structure permits reforming of the 
term-document matrix to include rows corresponding to 
phrases in the user query for the purposes of processing that 
query. Rows added to the term-document matrix for han- 
dling of phrases in a user query are removed after the user 
query has been processed. 

Following step 10, at step 11, the disclosed system 
formulates, in response to the term spread matrix, error 
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covariance matrix, and user query vector, a constrained 
optimization problem. The choice of a lambda value for the 
constrained optimization problem set up in step 11 is a 
Lagrange multiplier, and its specific value determines a 

5 trade-off between the degree of fit and the stability of all 
possible solutions to the constrained optimization problem. 

At step 12 of FIG. 1, the disclosed system computes the 
similarity between each of the electronic information files 
and the user query by solving the constrained optimization 

10 problem formulated in step 11. Specifically, in an illustrative 
embodiment, the disclosed system generates a solution 
vector consisting of a plurality of solution weights 
("document weights"). The document weights in the solu- 
tion vector each correspond to a respective one of the 

j 5 electronic information files, and reflect the degree of corre- 
lation of the user query to the respective electronic infor- 
mation file. At step 13, the disclosed system sorts the 
document weights based on a predetermined ordering, such 
as in decreasing order of similarity to the user query. 

20 At step 14, the disclosed system automatically builds a 
lexical knowledge base responsive to the solution of the 
constrained optimization problem computed at step 12. 
Specifically, at step 14, the original term-document matrix 
created at step 6 and potentially weighted at step 8, rather 

25 than the term spread matrix computed at step 9, is cross- 
multiplied with the unsorted document weights generated at 
step 12 (note that the document weights must be unsorted in 
this step to match the original order of columns in the 
term-document matrix) to form a plurality of term weights, 

30 one for each term. These term weights reflect the degree of 
correlation of the terms in the lexical knowledge base to the 
terms in the user query. 

At step 15, the disclosed system returns a list of docu- 
ments corresponding to the sorted document weights gen- 

35 erated at step 13, and the lexical knowledge base generated 
at step 14, to the user. 

Overall System Architecture of an Illustrative Embodiment 
FIG. 2 shows the overall architecture of the distributed 
information retrieval system. The system consists of four 

40 modules: Indexing 20, Storage 22, Search 24, and Query 26. 
The modules may run in different address spaces on one 
computer or on different computers that are linked via a 
network using CORBA (Common Object Request Broker 
Architecture). Within this distributed object framework, 

45 each server is wrapped as a distributed object which can be 
accessed by remote clients via method invocations. Multiple 
instances of the feature extraction modules 21 can run in 
parallel on different machines, and database storage can be 
spread across multiple platforms. 

50 The disclosed system may be highly modularized, thus 
allowing a variety of configurations and embodiments. For 
example, the feature extraction modules 21 in the indexing 
module 20 may be run on inexpensive parallel systems of 
machines, like Beowulf clusters of Celeron PCs, and Clus- 

55 ters of Workstations (COW) technology consisting of dual 
processor SUN Ultra 60 systems. In one embodiment, the 
entire architecture of FIG. 2 may be deployed across an 
Intranet, with the "inverse inference" search engine 23 
residing on a Sun Ultra 60 server and multiple GUI clients 

60 25 on Unix and Windows platforms. Alternatively, the 
disclosed system may be deployed entirely on a laptop 
computer executing the Windows operating system of 
Microsoft Corporation. 

Further as illustrated in FIG. 2, the indexing module 20 

65 performs steps to reduce the original documents 27 and a 
query received from one of the clients 21 into symbolic form 
(i.e. a term-document matrix and a query vector, 
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respectively). The steps performed by the indexing module 

20 can be run in batch mode (when indexing a large ^ _ whcre j^r _ iogf — ) 

collection of documents for the first time or updating the / £ Uf- Hitf ) 2 * 

indices) or on-line (when processing query tokens). The a * 

disclosed architecture allows extensibility of the indexing 5 

module 20 to media other than electronic text. ^ is the frequeQcy of term k in a document i, while the 

The storage module 22 shown in FIG. 2 includes a inverse document frequency of a term, id f^ is the log of the 

Relational DataBase Management System (RDBMS) 29, for ratio of the total number of documents in the collection to 

storing the term-document matrix. A search engine module io lhe numbcr o£ documents containing that term. As shown 

. . ... ,. , , . . f above. W (> is the we lahtmg applied to the value in cell ik of 

23 implements the presently disclosed inverse inference . * . & 4 . 6 ^T K «■ . r.u ■ u.- 

, 1 , . „ , . . j . r lhe term-document matrix. The effect of these weightings is 

search technique. These functions provide infrastructures to tQ normalize the slatistics of term f reque ncy counts. This 

search, cluster data, and establish conceptual links across the slep weights the term frequency counts according to: 1) the 

entire document database. length of the document in which the term occurs and 2) how 

Client GUIs (Graphical User Interfaces) 25 permits users 15 common the term is across documents. To illustrate the 

to pose queries, browse query results, and inspect docu- °f this weighting step with regard to document 

, ' , ,. * , length, consider a term equal to the word Clinton . An 

ments. In an illustrative embodiment, GUI components may eleclronic texl document that is a 300 page thesis on 

be written in the Java programming language provided by Cuban-American relationships may, for example, have 35 

Sun Microsystems, using the standard JDK 1.1 and accom- 20 counts of this term, while a 2 page biographical article on 

panying Swing Set. Various visual interface modules may be Bill Clinton may have 15 counts. Normalizing keyword 

employed in connection with the GUI clients 25, for counts by the total number of words in a document prevents 

example executing in connection with the Sun Solaris oper- thc 300 P ages thcsis 10 bc ov , cr * c biographical 

c „ w . , . . . . article for the user query Bill Clinton \ lo illustrate the 

ating system of Sun Microsystems, or in connection with thc . - ,. . /. . ... , 

«f . ^™ . . 25 significance of this weighting step with regard to common- 

Windows NT, Windows 95, or Windows 98 operating sys- ^ of tcnnSj consider ^ ^ w > and (<astro . 

terns of Microsoft Corporation. naul >. ^ former term likely m mQ documents out 

Indexing of 1000; the latter term may occur in 3 documents out of 

As shown in FIG. 3, a feature extraction module 21 100 °- ^ weighting step prevents over-emphasis of terms 

comprises a parser module 31, a stopwording module 33, a 30 J» l ^ a hi & Probability of occurring everywhere. 

stemming module 35, and a module for generating inverted . ^ , , * c i-i^ 

. * ^ . . As previously mentioned, the storage module 22 of r I G. 

indices 37. The output of the indexing process using the 2 inch|des a Relalional DataBa se Management System 

feature extraction module 21 includes a number of inverted (RDBMS) 29 for storing the information matrix 39 (also 

files (Hartman et al, 1992, No. 15 in Appendix A), shown as 35 re f er red to as the "term-document" matrix) output by the 

the "term-document" or "information" matrix 39. The parser indexing module 20. In a preferred embodiment, the inter- 

31 removes punctuation and records relative word order. In face between the RDBMS and the Indexing and Search 

addition, the parser 31 employs a set of rules to detect modules complies with OBDC standards, making the stor- 

acronyms before they go through the stopword 33 and a g c module vendor independent. In one embodiment, thc 

stemmer 35 modules. The parser 31 can also recognize 40 Enterprise Edition of Oracle 8.1.5 on Sun Solaris may be 

specific HTML, SGML and XML tags. The stopword 33 ^ployed. However, those skilled in the art will recognize 

r , f j ■ c r u . c c that a database management system is not an essential 

uses a list of non-diagnostic English terms. For purposes of . c At _ ,. , , . J A . r, . 

. . . , „ r _ r . , component of the disclosed invention. For example, in 

example, the stemmer 35 is based on the Porter algorithm an()thcr cmbodimcnt a filc syslcm may bc employcd for lhis 

(described in Harman et al, 1992, No. 15 in Appendix A). 45 purpose) inst ead of a RDBMS. 

Those skilled in the art should recognize that alternative The concept synchronizer 28 is used by a parallelized 

embodiments of the disclosed system may employ stem- implementation of the indexing module. In such an 

ming methods based on successor variety. The feature implementation, at indexing time, multiple processors parse 

extraction module provides functions 37 that generate the and index eleclronic text files in parallel. The concept 

inverted indices by transposing individual document statis- 50 synchronizer 28 maintains a look up table of concept iden- 

tics into a term-document matrix 39. tification numbers, so that when one processor encounters a 

1-he indexing performed in the embodiment shown in keyword which has already been assigned a concept iden- 

FIG. 3 also supports indexing of document attributes. !» flcatl ° n number by another processor, the same concept 

Examples of document attributes are HTML, SGML or identification number is used, instead of creating a new one. 

XML document tags, like date, author, source. Each docu- 55 In lh " wa * the co f nce P' s y" ch ™^ 28 prevents having 

ment attributes is allocated a private row for entry in the m0 ( re lhan one row for lhe same term m the '^-document 

matrix 

term-document matrix. As noted above, weighting of the Search 

elements of the termniocument matrix 39 may reflect abso- ^ e ^ Tch engifle 23 ^ based Qn a data drivcQ inductiye 

lute term frequency count, binary count, or any of several 60 lcaming modcl> of which IjS , jg an ^ , c (Rcrry ct a|f 

other measures of term distributions that combine local 1995> Na 5 in Appendix A; Landauer and Dumais, 1997. 

weighting of a matrix element with a global entropy weight Na 16 in Appendix A). Within this class of models, the 

for a term across the document collection, such as inverse disclosed system provides distinct advantages with regard 

document frequency. In an illustrative embodiment, high to: 1) mathematical procedure; 2) precision of the search; 3) 

precision recall results arc obtained with thc following 65 speed ofcomputations and 4) scalability to large information 

weighting scheme for an element d^ of the term-document matrices. The disclosed system attempts to overcome the 

matrix: problems of existing systems related to synonymy and 
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polysemy using a data driven approach. In other words, LSI transforms the matrix D as D'-U^A^V^ where 

instead of using a lexical knowledge base built manually by A=diag(X a , . . . and {X,-,i«l,k} are the first k ordered 

experts, the disclosed system builds one automatically from singular values of D, and the columns of V k and V ft are the 

the observed statistical distribution of terms and word first k orthonormal eigenvectors associated with DD r and 

co-occurrences in the document database. 5 D r D respectively. From thus wc see that ^(UA)* and 

FIG. 4 shows an example of a term-document matrix 40, A»V iT{A p j-1,2, . . . ,n}. The columns of A are a set of norm 

and also illustrates some of the difficulties associated with preserving, orthonormal basis functions. If we use the cosine 

existing systems. 'Hie term-document matrix 40 of FIG. 4 is metric to measure the distance between the transformed 

shown, for purposes of illustration, loaded with word counts documents and query, we can show that as k-*n 

for 16 keyword terms (rows 42) in 15 documents (columns 10 

44). The example of FIG, 4 illustrates testing of latent *J. a w 

semantic retrieval. Topics present in document collection of cosM;, a) = fj^Tjjjj^jj * jj^jj 

FIG. 4 are "GEOGRAPHY" (documents b3, b4, b6 and J 
bl2), "THEATER" (bl, b5, b8, b9, blO, and bl5), and 

"SHAKESPEARE" (b7 and bll). The keyword "Shakes- 15 where w=A r a is the smallest \ 2 norm solution to the linear 

peare" appears only in documents b7 and bll. The docu- system Dw«q. Reducing the number of eigenvectors in the 

ments semantically related to the "THEATER" topic, approximation to the inverse of D has a regularizing effect 

however, may also be relevant to a search query which on the solution vector w, since it reduces its norm, 

includes the single keyword "Shakespeare". The present invention is based on the recognition that the 

FIG. 5 shows clustering for the document collection 20 measurement of the distance between the transformed docu- 

rcflcctcd by the table of FIG. 4, as obtained using an LSI ments and query, as stated above is a special solution to the 

approach, as in some existing systems. The dots in each of more general optimization problem 

the graphs in FIG. 5 are plane projections of individual . 

documents into "concept space", as determined by a choice min 'K* subject to Dw ^ (2) 
of the first few eigenvectors. Documents which deal with 25 where ||f(w)|| n is a functional which quantifies some property 
similar topics cluster together in this space. The key to of the soi^n vector Wf n i s the order of the desired norm, 
successful semantic retrieval is to select a subspace where D ^ the term-document matrix and q is a query vector. The 
documents 54 which contain the keyword "Shakespeare" S p CCtr al expansion techniques of linear inverse theory 
cluster as a subset of all documents 56 which deal with the (p ar ker, 1977, No. 27 in Appendix A; Backus, 1970, No. 1 
topic of "THEATER". This is the case for the two projec- 30 in Appendix A), wavelet decomposition and atomic decora- 
tions shown by the graphs 50 and 52, but not for graphs 51 pos i t ion by basis pursuit (Chen et al, 1996, No. 9 in 
and 53. Graphs 51 and 53 in FIG. 5 are examples where the Appendix A) and wavelet packets (Wickerhauser, 1994, No. 
"SHAKESPEARE" documents 54 do not appear as a sub- 35 in Appendix A) provide a number of computationally 
cluster of the "THEATER" documents 56. Graphs 50 and efficient methods for decomposing an overdetermined sys- 
52, on the other hand, are examples where the "SHAKES- 35 tem mt0 an optimal superposition of dictionary elements. 
PEARE" documents 54 appear as a subcluster of the "THE- ^ disclosed search engine includes an application of the 
ATER" documents 56. It is difficult to predetermine which Backus and Gilbert inversion method to the solution of 
choice of projection axes x-y that will cause the desired equation (2) above, 

effect of clustering the "SHAKESPEARE" documents as a i nverS e Inference Approach of the Disclosed System 

subcluster of the "THEATER" documents. More 40 i nve rse theory departs from the multivariate analysis 

specifically, it is difficult to predetermine how many approach implied by LSI by modeling the information 

eigenvectors— and which ones— one should use with LSI in retrieval process as the impulse response of a linear system, 

order to achieve this result. FIG. 5 illustrates that there is no approac h provides a powerful mechanism for control 

way of pre-dctermining the combination of axes which and feedback of the information process. With reference to 

cause the "SHAKESPEARE" documents to appear as a 45 p ress et al (1997), tfo. 28 in Appendix A, the inverse 

subcluster of the "THEATER" documents. problem is defined by the Fredholm integral equation: 
US I and Matrix Decomposition 

The SVD employed by the LSI technique of equation (1) c-i ( .+n,.-Jr i (x)H<x)^+/ )i 
above provides a special solution to the overdetermined 

decomposition problem so whcrc c / Ls a nois y and ^precise datum, consisting of a 

signal s, and noise n y ; r ( . is a linear response kernel, and w(x) 

Dm<liA is a model about which information is to be determined. In 

q-a)a the disclosed approach to information retrieval, the above 

where D is an mxn term-document matrix, q is a query integral equation translates as 

vector with m elements; the set of basis functions (0 is mxk 55 g^V^-J^M*)****"* (3) 

and its columns are a dictionary of basis functions {(o,-, ' ' ' 

j«l,2, . . . ,k<n}; A and u are a kxn matrix and k- length where q,-, an element in the query datum, is one of an 

vector of transform coefficients, respectively. The columns imprecise collection of terms and term weights input by the 

of A are document transforms, whereas a is the query user, q" f is the best choice of terms and term weights that the 

transform. Ranking a document against a query is a matter 60 user could have input to retrieve the documents that arc most 

of comparing a and the corresponding column of A in a relevant to a given search, and n,. is the difference between 

reduced transform space spanned by co. The decomposition the user's choice and such an ideal set of input terms and 

of an overdetermined system is not unique. Nonuniqucncss term weights. A statistical measure of term distribution 

provides the possibility of adaptation, i.e. of choosing across the document collection, D^x), describes the system 

among the many representations, or transform spaces, one of 65 response. The subscript i is the term number; x is the 

which is more suited for the purposes of the disclosed document dimension (or document number, when 3 is 

system. discretized). The statistical measure of term distribution may 
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be simple binary, frequency, or inverse document frequency The relationship between a point estimate of w c and w can 

indices, or more refined statistical indices. Finally, in the be written as 
present context, the model is an unknown document distance 

w(x) that satisfies the query datum in a semantic transform M*W^*M*')<k' 

space. Equation (3) above is also referred to as the forward 5 , t . ,., , . j • 

model euation where o is a resolution kernel, whose width or spread is 

, * ' , , . . ^ . . minimized by the disclosed system in order to maxim ize the 

The solution to equation (3) in non-unique. The opttmi- resdving power of me If we substilute equation (5) 

zation principle illustrated by equation (2) above considers inU) cqualion (3) W(J arrivc al an ^ expression for lhe 

two positive functionals of w, one of which, B[w], quantifies resolution kernel 6 

a property of the solution, while the other, A[w], quantifies io 

the degree of fit to the input data. The present system „ 

operates to minimize A[w] subject to the constraint that 6(x t S) = ^i^Ditf) 
B[w] has some particular value, by the method of Lagrange 
multipliers: 

35 The Backus and Gilbert method chooses to minimize the 

min AlH-^x^tw] second moment of the width or spread of 6 at each value of 

x, while requiring it to have unit area. 

or These mathematical preambles lead to the following 

expressions for the functionals A and B: 

6 (4) 20 

— MM + A*MI = 0 A^-xfh^xfdx'^Tixynx) 

where X is a Lagrange multiplier. The Backus-Gilbert , „ f/ . x2r w .\rw * . , • . . . j 

me.hod "differs from other regulation methods in the 25 where ^/(a; -at) 0,(^(^ .8 the spread matrix, and 

natureofilsfunctionalsAandB.-fPressetal, 1997,No.28 S * K ** covanance matrix of the errors n.-m the input 

in Appendix A). These functionals maximize both the sta- ^ vc u clor ' c ° m P uted « S, r Covar[n 1 .,n.]=5 v n 1 . , .f we 

bility (B) and the resolving power (A) of the solution. An assu . m , e tha ' the e " ors * on the f elements of ' he ! n P ut 

additional distinguishing feature is that, unlike what happens are independent By allowing for errors in the input query 

in conventional method^ the choice of the constant X which 30 vcctor ' whlch 1S on ,hc terms f ,n ° ngm T^'- , 

determines the relative weighting of Aversus B can easily be P' esent s / slem altac ^ s a margm of uncertainty to the initial 

made before any actual data is processed. cto.ceof terms input by the user. Since the user smmal term 

T i _ . * , , T selection may not be optimal, the present system advanta- 

implementation of an Illustrative Embodiment the Inverse } fof a mutf Q of ^ Qf a d of 

Inference Engine flexibility in this regard. 

The following description of an illustrative embodiment 35 llie optimization problem can therefore be rewritten as 
of the disclosed system is made with reference to the concise 
treatment of Backus and Gilbert inversion found in Press et 
al. (1997), No, 28 in Appendix A, The measurement of a 
document-query distance w c is performed by an illustrative 

embodiment in a semantic transform space. This semantic 40 su ^: ect to TfxYH-1 

transform space is defined by a set of inverse response whcrc x fe a J grangc muUiplicr The constraint follows 

kernels T, (x), such that from the requ j rement mat lhe reso !ution kernel 6 has unit 

area. Solving for T(x) we have an explicit expression for the 

^c(x) ~ ^ T,{x)qi ^ document transform performed by the present system: 

r[x) [rw+xsr^.H (6) 

Here the document-query distances w c appear as a linear H [r(x) + AS]~ l -H 
combination of transformed documents l\(x) and the terms 

in input query q ( , where i is the term number. The inverse so Substituting into (5), wc have an expression for the distance 

response kernels reverse the relationship established by the between documents and the query q, as performed by the 

linear response kernels D,{x) in the forward model equation disclosed system* 
(3). In this particular embodiment, the D ( {x)'s arc binary, 

frequency, or inverse document frequency distributions. The + t H ^ 

integral of each term distribution D,(x) is defined in the 55 "A*) = q \n x \ + xs\- 1 H 
illustrative embodiment as 



min,4 [w] + \B[w] - T(x) • [f{jf) + XS] • T[x) 



IFHO^dx 

In finding a solution to equation (3), the disclosed system 
considers two functionals as in equation (4) above. As 
before, the functional B[w]-Var[wJ quantifies the stability 
of the solution. The functional A[w], on the other hand, 
measures the fit of the solution. The degree of fit is measured 
as the expected deviation of a computed solution w^, from the 
true w. The true w gives the ideal choice of query keywords 
q", when substituted into the forward model equation (3). 



Note that there is no need to compute the inverse of the 
matrix [F(x)+XS]~ J explicitly. Instead, the present system 

60 solves for some intermediate vector y in the linear system 
[r(x)+KS] y-H, and substitutes y for [r(x)+XS]' J H in (7). 
A property of the matrix T which plays to the advantage of 
the disclosed system is that it is sparse. The particular 
computational method used in the vector solution of equa- 

65 tion (7) by an illustrative embodiment is LSQR, which is an 
iterative method for sparse least squares, from a C imple- 
mentation of the UNPACK library. 



5/12/04 EPR1.1 21-26 



US 6,5 

15 

Optional parameters available in an illustrative embodi- 
ment are: 1) the dimensionality of the semantic transform 
space; 2) latent term feedback; 3) latent document list; 4) 
document feedback. The value of the Lagrangian multiplier 
\ in (7) determines the dimensionality of the transform 
space. The larger the value of X, the smaller the number of 
concepts in transform space, and the coarser the clustering 
of documents. The effect of the regularization is that rel- 
evance weights are assigned more uniformly across a docu- 
ment collection. A relevance judgement is forced even for 
those documents which do not explicitly contain the key- 
words in the user query. These documents may contain 
relevant keyword structures in transform space. By contrast, 
an exact solution to equation (2) with X-0 corresponds to the 
rigid logic of the vector space model, where the documents 
are untransformed. 

In an illustrative embodiment, the disclosed system 
achieves latency by sorting the coefficients in the solution to 
equation (7). Positive coefficients are associated with 
semantic bases which contain the keywords in the query; 
negative coefficients arc associated with semantic bases 
which contain latent keywords. To understand keyword 
structures in this transform space, in FIG. 6 we consider the 
inverse solution for the input query "Shakespeare" for the 
example term-document matrix of FIG. 4. The graph 62 of 
FIG. 6 illustrates the comparison of the desired output query 
q (solid line 63) and the computed output query q f 
(undistinguishable from q) for the l 2 -norm minimizing solu- 
tion. The output q' is computed as a linear superposition of 
the first seven bases (also shown in FIG. 6), ordered by 
decreasing coefficients |aj. Bases with positive u, (basis 1 
and basis 2) are shown with continuous lines. Bases with 
negative a ( - (basis 3, basis 4, basis 5, basis 6, and basis 8) are 
shown with dotted lines. The positive bases contain prima- 
rily the input query keyword and contribute significantly to 
the query approximation. They also contain several other 
keywords (e.g. "theatre", "comedy") which are directly 
associated with the keyword "Shakespeare" across the docu- 
ment collection. These associated keywords must be sub- 
tracted in order for the approximation q' to match the desired 
output q. The negative bases accomplish this. The negative 
bases define partitions (or groups) of documents that contain 
many of the same keyword patterns found in the positive 
bases, this time never in direct association with the keyword 
"Shakespeare". Consequently, the negative bases span the 
space of the latent semantic documents. Latent semantic 
documents arc documents that, while not containing any of 
the keywords in the user query, may contain a statistically 
significant number of keywords conceptually related to the 
keywords in the user query. 

The graph 62 displaying q and q' in FIG. 6 illustrates that 
they are virtually identical, and that they accordingly appear 
as a single plot 63 in the graph 62. In this way, FIG. 6 shows 
that by forming a linear combination of bases 1 through 7, 
an approximation of q' is obtained which is virtually iden- 
tical to the user query q. 

FIG. 7 shows the semantic keyword feedback obtained by 
isolating positive and negative coefficients in the truncated 
basis function expansion for the query approximation q c . As 
shown in FIG. 7, the inverse optimization problem is solved 
for a number of single keyword queries q 72. In addition to 
a ranked list of documents 74, the disclosed inverse infer- 
ence engine returns a primary list of conceptually relevant 
terms q c+ 76 (terms directly related to the term in q 72) and 
a secondary list of potentially relevant terms q,._ 78 (terms 
never associated directly with the term in q 72 but found in 
documents that describe concepts which are semantically 
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related to the term in q). The illustrative test results of FIG. 
7 were compiled based on a random sample of 11,841 
documents from the TREC (Text Retrieval Conference, a 
testing program for search engines sponsored by National 

5 Institute of Standards of the United States). In particular, 
documents in the sample are articles and newswires from the 
San Jose Mercury News and API. 

As illustrated by FIG. 7, the disclosed inverse inference 
engine actually uses information derived from the data to 

10 suggest primary and secondary term lists to the user. Among 
the top documents returned for each query, several relevant 
documents may appear which do not contain the input 
keywords. For instance, the unabridged text of the eighth 
most relevant document returned from a 0.3 second search 

15 of 4,000 articles in the San Jose Mercury News (TREC), in 
response to the query "plane disaster" is "Upcoming shortly 
will be a writethru to a0516 f PM -Britain-Crash, to update 
with flight recorders found, experts saying both engines may 
have failed.". Note that, while the returned document does 

20 not contain any of the keywords in the query ("plane" and 
"disaster"), it is in fact a very brief newswire about a plane 
crash which has just occurred. These results are remarkable, 
considering that this is a very short document compared to 
the average document size in the collection. 

25 Graphical User Interface and Internet Navigation Tool 

In one embodiment of the disclosed system, a GUI is 
provided in the Java programming language, based on the 
JDK1.1 and accompanying Swing Set from SunSoft. The 
GUI consists of a research module for testing various 

30 implementation options outlined above, and a more sophis- 
ticated module that includes a hypernavigation tool referred 
to herein as a "soft hyperlink". 

The snapshots in FIGS. 8 and 9 show the illustrative GUI 
and 80 hypernavigation tool 90. The GUI of FIG. 8 shows 

35 the top of the document list retrieved for a TREC-3 Category 
A (an information retrieval task performed on 742,000 
documents from the TREC corpus) adhoc query. 

FIG. 9 shows a prototype implementation of the soft 
hyperlink. The navigation tool of FIG. 9 provides freedom 

40 to move through a collection of electronic documents inde- 
pendently of any hyperlink which has been inserted in the 
HTML page. A user may click on any term in a document 
page, not just the terms that are hyperlinked. Let's assume 
that the user clicks on the word "Kremlin". The disclosed 

45 search engine executes in the background and retrieves a list 
of related terms. A compass display appears with pointers to 
the first four concepts returned by the engine. Now, the user 
has a choice to move from the current document to one of 
four document lists which cover different associations of the 

so keyword "Kremlin": 1) "Kremlin and Yeltsin"; 2) "Kremlin 
and Gorbachev"; 3) "Kremlin and Russia"; 4) "Kremlin and 
Soviet". An additional modality of the disclosed system 
allows the user to jump from a current document to the next 
most similar document, or to a list of documents that are 

55 relevant to a phrase or paragraph selection in the current 
page. 'Ihe "soft hyperlink" of FIG. 9 provides ease and 
freedom of navigation without the complexities of a search 
engine. 

FIG. 10 shows steps performed by an illustrative embodi- 
60 mcnt of the disclosed system for providing an Internet 
navigation tool. At step 100 of FIG. 10, the disclosed system 
captures a user indication of an initial term displayed in 
connection with a document, such as a word being displayed 
in connection with the presentation of a web page through an 
65 Internet Browser application program. The disclosed system 
may show that the initial term has been captured by causing 
the initial term to be highlighted with the user display. 
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files containing the first occurrence of initial search term to 
the user, with the initial search term being highlighted in 
some manner. 

In illustrative embodiments, the user interfaces of FIG. 8 
and FIG. 9 may be implemented on Unix, Windows NT, 
Windows 95, 98 or 2000 platforms and provided with 
CORBA wrappers for deployment over a distributed net- 
work. 

Latent Information 

In the disclosed inverse solution, a positive and a negative 
semantic space are considered. Accordingly, the disclosed 
system returns a list of direct document hits (documents that 
contain some of the keywords in a query) and a list of latent 
semantic hits (documents that do not contain any of the 



Alternatively, any other form of indication may be 
employed, such as underlining, changing color, etc. The 
initial term may be any one or set of display objects within 
the web page, and specifically may consist of or include one 

or more non-hyperlinked display objects. For example, the 5 
initial term may include a phrase, a paragraph or a figure 
indicated by the user. 

At step 102, the disclosed system issues an initial search 
request, via a search engine, using an initial search query 

consisting of the initial term. At step 104, a plurality of terms 10 
that are related to the initial search query are received as 
search results from the search engine. These related terms 
may be, for example, sorted in decreasing order of correla- 
tion to the initial term. The disclosed system may attach a 

relevance level to each one of a predetermined number of the 15 keywords in a query, but which may be relevant to a query), 

initial search result terms, the relevance level reflecting a The user can switch between the two lists. In an illustrative 

correlation to the initial term, and these relevance levels may example, a search on the TREC corpus for a "crisis caused 

be displayed to the user. In an illustrative embodiment, the by separatist or ethnic groups" (FIG. 11) would return 

relevance levels reflect a lexical correlation between the information on various crises in Transylvania, the Soviet 

initial term and each respective one of the initial search 20 Union and Albania in a first panel 110. When the user selects 

result terms. the latent list, as shown in a second panel 111, a vast body 

The disclosed system then selects a predetermined num- of information on the Lithuanian crisis is discovered, which 

ber of the related terms returned by the search engine. The would otherwise be missed. The articles in the second panel 

related terms may, for example, reflect the contents of a 111 do not contain any of the keywords in the query. Instead, 

generated lexical knowledge base. In an illustrative 25 for example, the language in the articles in the second panel 

embodiment, the disclosed system presents the selected 111 refers consistently to a struggle for "independence" and 

predetermined number of related terms to the user through to "a linguistic minority". The disclosed search technique 

a "compass" like display interface, however, this is only one may locate many more relevant documents than a conven- 

of many ways in which the terms could be presented to the tional search engine, because of its latent concept associa- 

user. For example, in alternative embodiments, such related 30 tions. Because the rankings of the positive and latent docu- 

terms could be presented to the user through a drop-down mcnts differ by several orders of magnitude, in an illustrative 

menus or list, or some other graphical presentation. embodiment, the two lists are maintained separately. 

The disclosed system then captures an indication from the Alternatively, an empirical weighting scheme may be 

user of at least one of the related terms. At step 106, in employed across both lists, 

response to the selection by the user of some number of the 35 Speed and Memory Usage 

related terms, the disclosed system issues at least one An embodiment of the disclosed system provides query 

secondary search request. The search query for the second- times of 7,0 sec for TREC category B (170,000 docs) and 

ary search request combines the selected related term or 30.5 sec for TREC category A (742,000 docs) on a SUN 

terms and the initial search term. In an illustrative ULTRA 60, which compares favorably to prior systems. The 

embodiment, the disclosed system forms a logical AND 40 disclosed system advantageously provides performance 

expression including one or more initial search result terms times that are sublinear. The scalability of the disclosed 

selected by the user from the initial search result terms, approach allows establishment of latent semantic links 

together with the initial search term. The secondary search across extremely large collections, by comparison to what is 

query thus includes a logical AND expression between possible with the SVD approach of existing systems, 

selected ones of initial search result terms and the initial 45 Memory requirements for the disclosed system vary accord- 

term. ing to the sparsity of the matrix and term distribution. 

The disclosed system then stores a number of secondary Other Commercial Applications of the Disclosed System 

search result document weights at step 108, for example in A search engine may only be one application of the 

decreasing order. The secondary search result document disclosed information retrieval technology. The disclosed 

weights are received in response to the secondary searches 50 technology may form the basis for a variety of information 



issued at step 106, and the decreasing order in which they are 
stored places the documents that are most related to the 
secondary search query a the beginning of the list. 

At step 109, the disclosed system generates a number of 
display objects associated with the secondary search results. 
In this regard, the disclosed system retrieves the electronic 
information file associated with the first weight in the list of 
sorted document weights, and displays to the user a portion 
of that electronic information file containing the first occur- 



retrieval tools. Some of these potential applications are 
outlined below. 
Semantic Interpreter 
The disclosed information retrieval technology may form 
55 the basis for a tool referred to as a "semantic interpreter". 
'ITie semantic interpreter summarizes evolutionary trends in 
news articles, and performs categorization of speech or 
on-line chat monitoring. It is a browsing tool which allows 



a user to rapidly compare the content of a current document 

rence of the initial search term, with the initial term being 60 set to some earlier document set, and/or determine or 

highlighted or otherwise emphasized in some way. 'Die summarize conceptual trends in a conversation. As illus- 

disclosed system further retrieves, in response either to a trated in FIG. 12, the semantic interpreter may perform a 

selection or indication by the user, or in response to a search combining a scries of terms (query 120) with one or 

predetermined number, one or more electronic information more tag filters 122. The tag filters 122, for example, identify 

files associated with the document weights generated in 65 different time intervals corresponding to creation or modi- 

response to the secondary searches issued at step 106. The fication times associated with various ones of the electronic 

disclosed system displaying portions of these information text files or other types of input documents. The tag filters 
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122 may further indicate specific participants in a 
conversation, or other identifiable characteristics of specific 
ones of the input documents represented by the term- 
document matrix 123. The matrix 123 is subset or parti- 
tioned by the subsctting module 124, using tag 
specification(s) 122, and the inverse inference engine pro- 
vides concept feedback specific to each of the partitions A 
126, 13 127, and C 128. This mechanism allows the user to 
compare the content of a current document set to some 
earlier document set, and determine conceptual trends. Input 
to the semantic interpreter could be electronic text from the 
Web, an electronic database, or digitized speech from a 
speech recognizer 

Intelligent Sorting of Large and Unstructured Electronic 
Collections 

As shown in FIG. 13, a recursive implementation of the 
disclosed inverse inference technique leads to a fast method 
for partitioning database space with groups of bases which 



10 



are hierarchically arranged in trees. A distinguishing term 

132 (for example "CIA") used to characterize a cluster is 20 spirit of the appended claims, 
dropped from the indices of the term-document matrix 134, 
after initial differentiation. The inverse inference problem is 
then solved for the subset of the term -document matrix 134 
which clustered around the dropped concept term 132. The 
new bases are used to partition the parent cluster (CIA). This 25 
partitioning is illustrated by the tree graph 130. The tree 
graph 130 is interpreted top to bottom. The dotted line 131 
indicates that the tree is very large, above and below the 
relatively small section shown in FIG. 13. Above the NASA 



drives); or (c) information conveyed to a computer through 
communication media for example using baseband signaling 
or broadband signaling techniques, including carrier wave 
signaling techniques, such as over computer or telephone 
networks via a modem. In addition, while the invention may 
be embodied in computer software, the functions necessary 
to implement the invention may alternatively be embodied 
in part or in whole using hardware components such as 
Application Specific Integrated Circuits or other hardware, 
or some combination of hardware components and software. 

While the invention is described through the above exem- 
plary embodiments, it will be understood by those of ordi- 
nary skill in the art that modification to and variation of the 
illustrated embodiments may be made without departing 
from the inventive concepts herein disclosed. Specifically, 
while the preferred embodiments are described in connec- 
tion with various illustrative data structures, one skilled in 
the art will recognize that the system may be embodied using 
a variety of specific data structures. Accordingly, the inven- 
tion should not be viewed as limited except by the scope and 



Appendix A 

Below is a list of the documents referred to in the present 
disclosure: 

1. Backus, G., Inference from inadequate and inaccurate 
data, Froc. Nat. Acad. Sci. U.S., 65, pp. 1-7, pp.281-287, 
and 67, pp. 282-289,1970. 

2. Bartell, B. T, W. C. Cottrell, and Richard K. Belew, 
Latent Semantic Indexing is an Optimal Special Case of 
Multidimensional Scaling, 1996 



node 135, and the CIA node 137, there may, for purposes of 30 3 Bateman, J. A., Kasper, R. T, Moore J. D., and Whitney, 



example, be a parent node (not shown) GOVERNMENT 
AGENCIES. 'I "he CIA node 137 is a child node of such a 
GOVERNMENT AGENCIES node and a parent node of 
BUCKLEY 140, FERNANDEZ 141, and the dotted line 131 
indicates that there could be one or more children of CIA 35 
137 to the right of FERNANDEZ. An illustrative example of 
how child nodes may be generated from parent nodes is now 
described with reference to FIG. 13. Having initially 
grouped all documents pertaining to the CIA, and consid- 
ering that each document is a column in the term-document 40 
matrix, the constrained optimization problem for the subset 
of the matrix comprising only these columns may now be 
solved. The CIA term can be removed, after forming the 
subset and prior to solving the constrained optimization 7 
problem, since CIA now appears in all the documents which 45 
form the subset, and it is therefore non-diagnostic. The 
operation is repeated for all clusters or all major matrix 
partitions. This recursive scheme should be fast and efficient 
since the inverse algorithm would be applied to progres- 
sively smaller partitions of the term-document matrix. Tests 50 8 
have shown that an inversion for a 100,000x100,000 parti- 
lion takes an implementation of the disclosed system only 
about 10 seconds. In addition, this operation is paratlelizable 
with respect to each node in the tree 130. Such "Inverse 9 
Decision Trees" could provide a fast and intuitive way to 55 
analyze large collections of documents. They could start a 
revolution equivalent to that caused by the introduction of 
classification and regression trees in multivariate regression 
analysis. 

Those skilled in the art should readily appreciate that the 60 
programs defining the functions of the present invention can 
be delivered to a computer in many forms; including, but not 
limited to: (a) information permanently stored on non- 
writable storage media (e.g. read only memory devices 
within a computer such as ROM or CD-ROM disks readable 05 
by a computer I/O attachment); (b) information alterably 
stored on writable storage media (e.g. floppy disks and hard 



R. A. A general organization of knowledge for NLP: The 
PENMAN upper model, Technical Report, USC/ 
Information Sciences Institute, 1990. 

4. Bernstein, M, Bolter, J. D,, Joyce, M., and Mylonas, E., 
Architecture for volatile hypertext, Hypertext 92: Pro- 
ceedings of the Third ACM Conference on Hypertext, 
ACM Press, pp. 243-260, 1991 

5. Berry, M., S. T. Dumais, G. O'Brien, Using linear algebra 
for intelligent information retrieval, SIAM Review, Vol. 
37, No. 4, pp. 553-595, December 1995. 

6. Boose, J. H., A knowledge acquisition program for expert 
systems based on personal construct psychology, Interna- 
tional Journal of man-machine Studies, 23, pp 495-525 

7. Bouaud, J., Bachimont, B., Charlct, J., and Zwcigenbaum, 
P. Methodological Principles for Structuring an "Ontol- 
ogy". In Proc. Workshop on Basic Ontological Issues in 
Knowledge Sharing, International Joint Conference on 
Artificial Intelligence (IJCAI-95), Montreal, Canada, 
August 1995. 

Broglio J, Callan JP, Croft WB. INQUERY system 
overview. In: Proceedings of the TIPSTER Text Program 
(Phase I). San Francisco, Calif.: Morgan Kaufraann, 
1994, pp 47-67. 

9. Chen, S., D. Donoho, M. Saunders, Atomic decomposi- 
tion by basis pursuit, Stanford University, Department of 
Statistics Tec finical Report, February 1996. 

10. Deerwesler, S„ S. T Dumais, G. W. Furnas, T. K. 
Landauer, and R. Harshman. Indexing by latent semantic 
analysis. Journal of the American Society for Information 
Science, 41 :391-407, 1 990. 

11. Dumais, S. T, Improving the Retrieval of Information 
from external sources, Behavior Res.'Meth., Instruments, 
Computers, 23 (1991), pp. 229-236 

12. Dumais, S. T., Piatt, J., Heckerman, D., and Sabarai, M., 
Inductive Learning Algorithms and Representations for 
Text Categorization, Proceedings ofACM'CIKM9$, Nov. 
1998. 
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13. Foltz, R W, Kinisch, W.,& Landauer, T. K. (1998). The 
measurement of textual Coherence with Latent Semantic 
Analysis. Discourse Processes, 25, 285-307. 

14. Guarino, N. Some Onto logical Principles for a Unified 
Top- Level Ontology. Spring Symposium on Ontological 
Engineering, AAAI Tech Report, SS-97-06, American 
Association for Artificial Intelligence, March 1997. 

15. Hartman, D., R. Baeza-Yates, E. Fox, and W. Lee, 
Inverted Files, in Information Retrieval, edited by W. F. 
Frakes and R. Baeza-Yates, Prentice-Hall, 1992. 

16. Landauer, T K., & Dumais, S. T. (1997). A solution to 
Plato's problem: The Latent Semantic Analysis theory of 
the acquisition, induction, and representation of knowl- 
edge. Psychological Review, 104, 211-240. 

17. Landauer, T. K., Foltz, P. W., & Laham, D. (1998). 
Introduction to Latent Semantic Analysis. Discourse 
Processes 25 259—284. 

18. Landauer, T. K., Laham, D., & Foltz, P. W., (1998). 
Learning human-like knowledge by Singular Value 
Decomposition: A progress report. In M. L Jordan, M. J. 
Kearns & S. A. Solla (Eds.), Advances in Neural Infor- 
mation Processing Systems 10, (pp. 45-51). Cambridge: 
MIT Press. 

19. Landauer, T K., Laham, D., Render, B., & Schreiner, M. 
E., (1997). How well can passage meaning be derived 
without using word order? A comparison of Latent 
Semantic Analysis and humans. In M. G. Shafto & P. 
Langley (Eds.), Proceedings of the 19th annual meeting of 
the Cognitive Science Society (pp. 412—417). Mawhwah, 
NJ.: Erlbaum. 

20. Lenat, D. B. and Guha, R. V. Building Large 
Knowledge-Based Systems. Reading, Mass.: Addison- 
Wesley, 1990. 

21. Lopresti, D., and J. Zhou, Retrieval strategies for noisy 
text, Fifth Annual Symposium on Document Analysis and 
Infortnalion Retrieval, pp. 255-269, Las Vegas, April 
1996. 

22. Madigan, D. and J. York. Bayesian graphical models for 
discrete data. International Statistical Review 63, 215-32 

23. Mahcsh, K. Ontology Development for Machine Trans- 
lation: Ideology and Methodology. Technical Report 
MCCS-96-292, Computing Research Laboratory, New 
Mexico State University, 1996. 

24. Mahcsh, K., J. Kud, and P. Dixon, Oracle at TREC-8: A 
Lexical Approach, Proceedings of The Eighth Text 
RElrieval Conference (TREC-8), NIST Special Publica- 
tion XXX-XXX, National Institute of Standards, 1999 

25. Miller, G., WordNet: An on-line lexical database. Inter- 
national Journal of Lexicography 3(4) (Special Issue), 
1990 

26. O'Brien, G. W., Information Management Tools for 
Updating an SVD-Encode Indexing scheme, master's 
thesis, The University of Knoxville, Knoxville, Tenn., 
1994 

27. Parker, R., Understanding inverse theory, Ann. Rev. 
Earth Planet. ScL, 5, pp. 35-64, 1977. 

28. Press, W. M, S. A. Teukolsky, W. T. Vetterling, and B. P. 
Flannery. Numerical Recipes in C. I*he Art of Scientific 
Computing. Cambridge University Press, 1997. 

29. Rajashekar,T. B. and W. B. Croft, Combining Automatic 
and Manual Index Representations in Probabilistic 
Retrieval, Journal of the American Society for Informa- 
tion Science, 46 (4):272-283, 1995. 

30. Salton, G., E. Fox, U. Wu, Extended Boolean informa- 
tion retrieval, Communications A CM, 26, pp. 1022-1036, 
1983. 

31. Vogcl, C, An Overview of Semiotics and Potential 
Applications for Computational Semiotics, First Interna- 
tional Workshop on Computational Semiotics, 1997 
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32. Vogel, C, Applying Computational Semiotics to Text 
Mining, Journal of AGSI, July 1998 

33. Waltz, D. L., and Pollack, J. B., massively parallel 
parsing: a strong interactive model of natural language 

5 interpretation, Cognitive Science, 9, pp. 51-74, 1985 

34. Werbach, K., Search and Serchability, Release 1.0, pp. 
1-24, Jan. 15, 1999. 

35. Wickerhauser, M. V, Adapted Wavelet Analysis from 
theory to software, 1994 

36. Wolfe, M. B., Schreiner, M. E., Rehder, B., Laham, D., 
10 Foltz, P. W, Kintsch, W., & Landauer, T. K. (1998). 

Learning from text: Matching readers and text by Latent 
Semantic Analysis. Discourse Processes, 25, 309-336. 
What is claimed is: 

1. An information retrieval method comprising the steps 

is of: 

generating a term -document matrix to represent electronic 
information files stored in a computer system, each 
element in said term-document matrix indicating a 
number of occurrences of a term within a respective 

20 one of said electronic information files; 

generating, responsive to said term -document matrix, a 
term-spread matrix, wherein said term spread matrix is 
a weighted autocorrelation of said term-document 
matrix, said term-spread matrix indicating an amount 

25 of variation in term usage in the information files and, 
also, the extent to which terms are correlated; 
receiving a user query from a user, said user query 
consisting of at least one term; 

3Q in response to said user query, generating a user query 
vector, wherein said user query vector has as many 
elements as the rows of the term -spread matrix; 
generating, responsive to said user query vector, an error- 
covariance matrix, wherein said error-covariance 

35 matrix reflects an expected degree of uncertainty in the 
initial choice of keywords of said user; 
formulating, responsive to said term-spread matrix, error- 
covariance matrix, and user query vector, a constrained 
optimization problem, wherein the choice of a lambda 

40 value equal to a Lagrange multiplier value in said 
constrained optimization problem determines the 
extent of a trade-off between a degree of fit and the 
stability of all solutions to said constrained optimiza- 
tion problem; 

45 generating, responsive to said constrained optimization 
problem, a solution vector including a plurality of 
document weights, each one of said plurality of docu- 
ment weights corresponding to one of each said infor- 
mation files, wherein each of said document weights 

50 reflects a degree of correlation between said user query 
and the corresponding one of said information files; and 
providing an information response to said user reflecting 
said document weights. 

2. The information retrieval method of claim 1, further 
55 comprising: 

parsing electronic text contained within said information 
files, wherein said parsing includes recognizing acro- 
nyms. 

3. The information retrieval method of claim 2, wherein 
(io said parsing further includes recording term positions. 

4. 'llie information retrieval method of claim 3, wherein 
said parsing further includes processing tag information 
within said information files. 

5. The information retrieval method of claim 4, wherein 
65 said tag information includes one or more HTML tags. 

6. The information retrieval method of claim 5, wherein 
said tag information includes one or more XML tags. 
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7. The information retrieval method of claim 6, wherein 
said parsing further includes extracting word roots. 

8. The information retrieval method of claim 7, wherein 
said parsing further includes generating concept identifica- 
tion numbers. 

9. The information retrieval method of claim 1, further 
comprising: 

generating an auxiliary data structure, said auxiliary data 
structure being indexed by said concept identification 
numbers, and said data structure storing the positions of 
all terms contained within the information files. 

10. The information retrieval method of claim 9, wherein 
said auxiliary data structure further stores tag information 
associated with respective ones of said information flics, 
wherein said tag information reflects at least one character- 
istic of said respective ones of said information files. 

11. The information retrieval method of claim 10, wherein 
said tag information reflects at least one date associated with 
each respective one of said information files. 

12. The information retrieval method of claim 2, wherein 
said parsing includes counting term occurrences in each 
information file. 

13. The information retrieval method of claim 1, wherein 
said step of generating said term-document matrix includes 
generating elements in said matrix reflecting the number of 
occurrences of each one of said terms in each one of said 
information files. 

14. The information retrieval method of claim 1, further 
comprising: 

determining that said user query includes at least one 
phrase; and 

responsive to said determining that said user query 
includes a phrase, adding a new row to said term- 
document matrix, each element in said new row con- 
taining the number of occurrences of said phrase in the 
respective one of said information files. 
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15. The information retrieval method of claim 14, further 
comprising determining said number of occurrences of said 
phrase in each said respective one of said information files 
by the number of occurrences of the individual terms 

5 composing said phrase and the proximity of said terras as 
indicated by the relative positions of said individual terms 
contained in said auxiliary data structure. 

16. The information retrieval method of claim 1, wherein 
said step of generating said term-document matrix includes 
generating each element in said term-document matrix as a 
binary weight denoting the presence or absence of a respec- 
tive one of said terras. 

17. The information retrieval method of claim 1, wherein 
said step of generating said term-document matrix includes 
weighting each element in said term-document matrix by a 

15 number of occurrence of a respective one of said terms 
within a respective one of said information files and by 
distribution of said respective one of said terms across the 
complete set of said information files. 

18. The information retrieval method of claim 1, further 
20 comprising sorting said document weights based on a pre- 
determined ordering. 

19. The information retrieval method of claim 18, wherein 
said predetermined ordering is decreasing order. 

20. The information retrieval method of claim 1, further 
25 comprising automatically building a lexical knowledge base 

responsive to the solution of said constrained optimization 
problem, wherein said building includes cross-multiplying 
said term-document matrix, rather than said term-spread 
matrix, by said document weights to generate a plurality of 
30 term weights, one for each one of said terms. 

21. The information retrieval method of claim 20, further 
comprising sorting said term weights based on a predeter- 
mined ordering. 

22. The information retrieval method of claim 21, wherein 
35 said predetermined ordering is decreasing order. 

***** 
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